41 - Deep Learning - Reinforcement Learning Part 1 [ID:17726]

50 von 133 angezeigt

Welcome back to deep learning. So today we want to discuss the basics of reinforcement

learning. We will look into how we can teach a system to play different games and we will

start with a very first introduction into sequential decision making. So here I have

a couple of slides for you. So you see that the topic is reinforcement learning and we

want to go ahead and talk about sequential decision making. Later in this course we will

talk also about reinforcement learning and all the details and we will also look into

deep reinforcement learning but today we will only look into sequential decision making.

Okay, sequential decision making. Well, we wanted to play a couple of games and the simplest

game that you can potentially think of is that you just pull a couple of levers and

if you try to formalize this then you end up in the so-called multi-armed bandit problem.

So let's do a couple of definitions. We need some actions and we formalize this as choosing

an action A at time t from a set of actions capital A. So these are a discrete set of

possible actions that we can do and if we choose an action then this has some implications

and if you choose that action A t then you will generate some reward Rt but the relation

between the action and the reward is probabilistic which means that there is a probably different

unknown probability density function that describes the actual relation between the

action and the reward. So if you think of your multi-armed bandit you have a couple

of slot machines, you pull one of the levers and then this generates some reward but maybe

all of these slot machines are the same and probably they are not so each arm that you

could potentially pull has a different probability of generating some reward Rt. Now you want

to be able to pick an action and in order to do so we define a so-called policy. So

the policy is a way of formalizing how to choose an action and the policy is essentially

also a probability density function that describes us the likelihood of choosing some action

and the policy is essentially the way how we want to influence the game. So the policy

is something that lies in our hand and we can define this policy and of course we want

to make this policy optimal with respect to playing the game. So what's the key element?

Well what we want to achieve? We want to achieve a maximum reward and in particular we not

just want to have the maximum reward for playing the game in every time step but instead we

want to compute the maximum expected reward over time. So we produce an estimate of the

reward that is going to be produced and we compute a kind of mean value over this because

this allows us to then estimate which actions produce what kind of rewards if we play this

game over a long time. So this is a difference to supervised learning because here we are

not saying do this action or do that action but instead we have to determine by our training

algorithm which actions to choose and obviously we can make mistakes and the aim is then to

choose the actions that will over time then produce the maximum expected reward. So it's

not so important if we lose in one step if we then on average still can generate a high

average reward. So the problem here is of course that the expected value of our reward

is not known in advance. So this is the actual problem of the reinforcement learning that

we want to try to estimate this expected reward and the associated probabilities. So what

we can do is we can form R as an one-hot encoded vector which reflects which action of A has

actually caused the reward. If we do so we can estimate the probability density function

online using an averaging and we introduce this as the function Q of A and this is the

so-called action value function which essentially changes with every new information that we

observe. So how can we do this? Well there is an incremental way of computing our Qt

of A and we can very easily show this we defined Qt as the sum over all the time steps so Qt

plus one of A equals the sum over all the time steps t and the obtained rewards and

of course you divide by t. Now we can show that this can be split up so we can take out

the last element of the sum which is RT and then only have the sum run from one to t minus

one and if we do so we can then also introduce the term t minus one because if you introduce

t minus one here and divide by one over t minus one this will cancel out to one so this

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr.-Ing. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:15:18 Min

Aufnahmedatum

2020-06-14

Hochgeladen am

2020-06-14 19:16:40

Sprache

en-US

Deep Learning - Reinforcement Learning Part 1

This video explains the concepts of sequential decision making and the multi-armed bandit problem.

Further Reading:
A gentle Introduction to Deep Learning

Tags

Per RSS abonnieren